Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing
نویسندگان
چکیده
Distributed-memory matrix multiplication (MM) is a key element of algorithms in many domains (machine learning, quantum physics). Conventional algorithms for dense MM rely on regular/uniform data decomposition to ensure load balance. These traits conflict with the irregular structure (block-sparse or rank-sparse within blocks) that is increasingly relevant for fast methods in quantum physics. To deal with such irregular data we present a new MM algorithm based on Scalable Universal Matrix Multiplication Algorithm (SUMMA). The novel features are: (1) multipleissue scheduling of SUMMA iterations, and (2) fine-grained task-based formulation. The latter eliminates the need for explicit internodal synchronization; with multiple-iteration scheduling this allows load imbalance due to nonuniform matrix structure. For square MM with uniform and nonuniform block sizes (the latter simulates matrices with general irregular structure) we found excellent performance in weak and strong-scaling regimes, on commodity and high-end hardware.
منابع مشابه
Two-dimensional cache-oblivious sparse matrix-vector multiplication
In earlier work, we presented a one-dimensional cache-oblivious sparse matrix–vector (SpMV) multiplication scheme which has its roots in one-dimensional sparse matrix partitioning. Partitioning is often used in distributed-memory parallel computing for the SpMV multiplication, an important kernel in many applications. A logical extension is to move towards using a two-dimensional partitioning. ...
متن کاملTime Integration of Tensor Trains
A robust and efficient time integrator for dynamical tensor approximation in the tensor train or matrix product state format is presented. The method is based on splitting the projector onto the tangent space of the tensor manifold. The algorithm can be used for updating time-dependent tensors in the given data-sparse tensor train / matrix product state format and for computing an approximate s...
متن کاملFast sparse matrix multiplication on GPU
Sparse matrix multiplication is an important algorithm in a wide variety of problems, including graph algorithms, simulations and linear solving to name a few. Yet, there are but a few works related to acceleration of sparse matrix multiplication on a GPU. We present a fast, novel algorithm for sparse matrix multiplication, outperforming the previous algorithm on GPU up to 3× and CPU up to 30×....
متن کاملAn Efficient Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats
Tensors, which are the linear-algebraic extensions of matrices in arbitrary dimensions, have numerous applications to data processing tasks in computer science and computational science. Many tensors used in diverse application domains are sparse, typically containing more than 90% zero entries. Efficient computation with sparse tensors hinges on algorithms that can leverage the sparsity to do ...
متن کاملFast Structured Matrix Computations: Tensor Rank and Cohn-Umans Method
We discuss a generalization of the Cohn–Umans method, a potent technique developed for studying the bilinear complexity of matrix multiplication by embedding matrices into an appropriate group algebra. We investigate how the Cohn–Umans method may be used for bilinear operations other than matrix multiplication, with algebras other than group algebras, and we relate it to Strassen’s tensor rank ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1504.05046 شماره
صفحات -
تاریخ انتشار 2015